Speech recognition apparatus, speech recognition method, and computer-readable recording medium in which speech recognition program is recorded

ABSTRACT

A speech recognition apparatus comprises a speech analyzer which extracts feature patterns of spontaneous speech divided into frames; a keyword model database which prestores keyword which represent feature patterns of a plurality of keywords to be recognized; a garbage model database which prestores feature patterns of components of extraneous speech to be identified; and a first likelihood calculator which calculates likelihood of feature values based on feature values patterns of each frames and keywords; a second likelihood calculator which calculates likelihood of feature values based on feature values patterns of each frames and extraneous speech. The device recognizes keywords contained in the spontaneous speech by calculating cumulative likelihood based on the calculated likelihood adding a predetermined correction value in the second likelihood calculator.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates to a technical field regardingspeech recognition by an HMM (Hidden Markov Models) method and,particularly, to a technical field regarding recognition of keywordsfrom spontaneous speech.

[0003] 2. Description of the Related Art

[0004] In recent years, speech recognition apparatus have been developedwhich recognize spontaneous speech uttered by man.

[0005] When a man speaks predetermined words, these devices recognizethe spoken words from their input signals.

[0006] For example, various devices equipped with such a speechrecognition apparatus, such as an navigation system mounted in a vehiclefor guiding the movement of the vehicle and personal computer, willallow the user to enter various information without the need for manualkeyboard or switch selecting operations.

[0007] Thus, for example, the operator can enter desired information inthe navigation system even in a working environment where the operatoris driving the vehicle by using his/her both hands

[0008] Typical speech recognition methods include a method which employsprobability models known as HMM (Hidden Markov Models).

[0009] In the speech recognition, the spontaneous speech is recognizedby matching patterns of feature values of the spontaneous speech withpatterns of feature values of speech which are prepared in advance andrepresent candidate words called keywords.

[0010] Specifically, in the speech recognition, feature values ofinputted spontaneous speech (input signals) divided into segments of apredetermined duration are extracted by analyzing the inputtedspontaneous speech, the degree of match (hereinafter referred to aslikelihood) between the feature values of the input signals and featurevalues of keywords represented by HMMs prestored in a database iscalculated, likelihood over the entire spontaneous speech isaccumulated, and the keyword with the highest likelihood as a recognizedkeyword is decided.

[0011] Thus, in the speech recognition, the keywords is recognized basedon the input signals which is spontaneous speech uttered by man.

[0012] Incidentally, an HMM is a statistical source model expressed as aset of transitioning states. It represents feature values ofpredetermined speech to be recognized such as a keyword. Furthermore,the HMM is generated based on a plurality of speech data sampled inadvance.

[0013] It is important for such speech recognition how to extractkeywords contained in spontaneous speech.

[0014] Beside keywords, spontaneous speech generally contains extraneousspeech, i.e. previously known words that is unnecessary in recognition(words such as “er” or “please” before and after keywords), and inprinciple, spontaneous speech consists of keywords sandwiched byextraneous speech.

[0015] Conventionally, speech recognition often employs “word-spotting”techniques to recognize keywords to be speech-recognized.

[0016] in the word-spotting techniques, HMMs which represent not onlykeyword models but also and HMMs which represent extraneous speechmodels (hereinafter referred to as garbage models) are prepared, andspontaneous speech is recognized by recognizing a keyword models,garbage models, or combination thereof whose feature values have thehighest likelihood.

SUMMARY OF THE INVENTION

[0017] Generally, keywords are recognizes by identifying a plurality ofextraneous speech using one HMM which is generated based on a pluralityof speech segments. However, low likelihoods are accumulated relativelybecause a plurality of the extraneous speech is identified by using oneHMM. Accordingly, device for recognizing spontaneous speech describedabove is prone to misrecognition.

[0018] The present invention has been made in view of the aboveproblems. Its object is to provide a speech recognition apparatus whichcan achieve high speech recognition performance without increasing thedata quantity of feature values of extraneous speech.

[0019] The above object of present invention can be achieved by a speechrecognition apparatus of the present invention. The speech recognitionapparatus for recognizing at least one of keywords contained in utteredspontaneous speech is provided with: an extraction device for extractinga spontaneous-speech feature value, which is feature value of speechingredient of the spontaneous speech, by analyzing the spontaneousspeech; a database in which at least one of keyword feature dataindicating feature value of speech ingredient of the keyword and atleast one of an extraneous-speech feature data indicating feature valueof speech ingredient of extraneous-speech is prestored, a calculationdevice for calculating likelihood which indicates probability that atleast part of the feature values of the extracted spontaneous speech ismatched with the keyword feature data and the extraneous-speech featuredata; and a determining device for determining at least one of thekeywords to be recognized and the extraneous-speech based on thecalculated likelihood, wherein the calculation device calculates thelikelihood by using a predetermined correction value when thecalculation device calculates the likelihood which indicates probabilitythat at least part of the feature values of the extracted spontaneousspeech is matched with the extraneous-speech feature data.

[0020] According to the present invention, the likelihood is calculatedbased on the extracted spontaneous-speech feature value and theextraneous-speech feature data adjusted by a predetermined correctionvalue, and at least one of the keywords and the extraneous speech to berecognized is determined based on the calculated likelihood.

[0021] Accordingly, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is utteredor due to such as calculation error produced when the likelihoodcalculated by using extraneous-speech feature data combinedcharacteristics of a plurality of feature values to reduce the amount ofdata, since the likelihood which indicates probability that at leastpart of the feature value of the extracted spontaneous speech is matchedwith the extraneous-speech component feature data can be adjusted by thepredetermined correction value, the keyword and the extraneous-speechcan be identified properly. Therefore, it is possible to preventmisrecognition and recognize keyword reliably.

[0022] In one aspect of the present invention, the speech recognitionapparatus of the present invention is further provided with; a settingdevice for setting the correction value based on noise level aroundwhere the spontaneous speech is uttered, wherein the calculation devicecalculates the likelihood by using the set correction value when thecalculation device calculates the likelihood which indicates probabilitythat at least part of the feature values of the extracted spontaneousspeech is matched with the extraneous-speech feature data.

[0023] According to the present invention, the determined correctionvalue is set based on noise level around where the spontaneous speech isuttered, and likelihood is calculated based on the feature values of theextracted spontaneous speech, the extraneous-speech feature dataadjusted by the set correction value, and the acquired keyword featuredata

[0024] Accordingly, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is uttered,since the likelihood which indicates probability that at least part ofthe feature values of the extracted spontaneous speech is matched withthe extraneous-speech components feature data can be adjusted by the setcorrection value, the keyword and the extraneous-speech can beidentified properly. Therefore, it is possible to prevent misrecognitionand recognize keyword reliably.

[0025] In one aspect of the present invention, the speech recognitionapparatus of the present invention is further provided with; a settingdevice for setting the correction value based on the ratio betweenduration of the determined keyword and duration of the spontaneousspeech when the determining device determines at least one of thekeywords to be recognized and the extraneous speech based on thecalculated likelihood, and wherein the calculation device calculates thelikelihood by using the set correction value when the calculation devicecalculates the likelihood which indicates probability that at least partof the feature values of the extracted spontaneous speech is matchedwith the extraneous-speech feature data.

[0026] According to the present invention, the determined correctionvalue is set based on the ratio between duration of the determinedkeyword and duration of the spontaneous speech, and likelihood iscalculated based on the feature value of the extracted spontaneousspeech, the extraneous-speech feature data adjusted by the setcorrection value, and the acquired keyword feature data

[0027] Accordingly, even under conditions in which misrecognition couldoccur due to such as calculation error produced when calculating thelikelihood using extraneous-speech feature data combined characteristicsof a plurality of feature values to reduce the amount of data, since thelikelihood which indicates probability that at least part of the featurevalue of the extracted spontaneous speech is matched with theextraneous-speech components feature data can be adjusted by the setcorrection value, the keyword and the extraneous-speech can beidentified properly. Therefore, it is possible to prevent misrecognitionand recognize keyword reliably.

[0028] In one aspect of the present invention, the speech recognitionapparatus of the present invention is further provided with; wherein theextraneous-speech feature data prestored in the database has data offeature values of speech ingredient of a plurality of theextraneous-speech.

[0029] According to the present invention, the likelihood is calculatedbased on the extracted spontaneous-speech feature values, the adjustedextraneous-speech feature data which has data of feature values ofspeech ingredient of a plurality of the extraneous-speech, and theacquired keyword feature data

[0030] Accordingly, since the likelihood is calculated based on data offeature values of speech ingredient of a plurality of theextraneous-speech, it is possible to identify the extraneous speechproperly using a small amount of data in recognizing the extraneousspeech. Furthermore, even under conditions in which misrecognition couldoccur due to such as calculation error produced when the likelihoodcalculated by using extraneous-speech feature data combinedcharacteristics of a plurality of feature values to reduce the amount ofdata, since the likelihood which indicates probability that at leastpart of the feature values of the extracted spontaneous speech ismatched with the extraneous-speech components feature data can beadjusted by the set correction value, the keyword and theextraneous-speech can be identified properly. Therefore, it is possibleto prevent misrecognition and recognize keyword reliably.

[0031] In one aspect of the present invention, the speech recognitionapparatus of the present invention is further provided with; in casewhere an extraneous-speech component feature data indicating featurevalue of speech ingredient of extraneous-speech component which iscomponent of the extraneous speech is prestored in the database,wherein: the calculation device for calculating likelihood based on theextraneous-speech component feature data when the calculation devicecalculates the likelihood which indicates probability that at least partof the feature values of the extracted spontaneous speech is matchedwith the extraneous-speech feature data and the determining device fordetermining at least one of the keywords to be recognized and theextraneous-speech based on the calculated likelihood.

[0032] According to the present invention, the likelihood is calculatedbased on the extracted spontaneous-speech feature value, the adjustedextraneous-speech component feature data and the acquired keywordfeature data, and at least one of the keywords to be recognized and theextraneous-speech is determined based on the calculated likelihood.

[0033] Accordingly, since the extraneous-speech and the keyword areidentified by calculating the likelihood based on the adjustedextraneous-speech component feature data, the extraneous-speech can beidentified properly by using a small amount of data in recognizing theextraneous speech. Therefore, it is possible to increase identifiableextraneous speech without increasing the amount of data required torecognize extraneous speech and improve the accuracy with which keywordis extracted and recognized.

[0034] Furthermore, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is utteredor due to such as calculation error produced when the likelihood iscalculated by using extraneous-speech feature data combinedcharacteristics of a plurality of feature values to reduce the amount ofdata, since the likelihood which indicates probability that at leastpart of the feature values of the extracted spontaneous speech ismatched with the extraneous-speech components feature data can beadjusted by the predetermined correction value, the keyword and theextraneous-speech can be identified properly. Therefore, it is possibleto prevent misrecognition and recognize keyword reliably.

[0035] The above object of present invention can be achieved by a speechrecognition method of the present invention. A speech recognition methodof recognizing at least one of keywords contained in uttered spontaneousspeech is provided with: an extraction process of extracting aspontaneous-speech feature value, which is feature value of speechingredient of the spontaneous speech, by analyzing the spontaneousspeech; an acquiring process of acquiring at least one of keywordfeature data indicating feature value of speech ingredient of thekeyword and at least one of an extraneous-speech feature data indicatingfeature value of speech ingredient of extraneous-speech, the keywordfeature data and extraneous-speech feature data prestoring in adatabase; a calculation process of calculating likelihood whichindicates probability that at least part of the feature values of theextracted spontaneous speech is matched with the keyword feature dataand the extraneous-speech feature data; and a determination process ofdetermining at least one of the keywords to be recognized and theextraneous-speech based on the calculated likelihood, wherein thecalculation process calculates the likelihood by using a predeterminedcorrection value when the calculation process calculates the likelihoodwhich indicates probability that at least part of the feature values ofthe extracted spontaneous speech is matched with the extraneous-speechfeature data.

[0036] According to the present invention, the likelihood is calculatedbased on the extracted spontaneous-speech feature value and theextraneous-speech feature data adjusted by a predetermined correctionvalue, and at least one of the keywords and the extraneous speech to berecognized is determined based on the calculated likelihood.

[0037] Accordingly, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is utteredor due to such as calculation error produced when the likelihoodcalculated by using extraneous-speech feature data combinedcharacteristics of a plurality of feature values to reduce the amount ofdata, since the likelihood which indicates probability that at leastpart of the feature values of the extracted spontaneous speech ismatched with the extraneous-speech components feature data can beadjusted by the predetermined correction value, the keyword and theextraneous-speech can be identified properly. Therefore, it is possibleto prevent misrecognition and recognize keyword reliably.

[0038] In one aspect of the present invention, the speech recognitionmethod of the present invention is further provided with; a settingprocess of setting the correction value based on noise level aroundwhere the spontaneous speech is uttered, wherein the calculation processcalculates the likelihood by using the set correction value when thecalculation process calculates the likelihood which indicatesprobability that at least part of the feature values of the extractedspontaneous speech is matched with the extraneous-speech feature data.

[0039] According to the present invention, the determined correctionvalue is set based on noise level around where the spontaneous speech isuttered, and likelihood is calculated based on the feature values of theextracted spontaneous speech, the extraneous-speech feature dataadjusted by the set correction value, and the acquired keyword featuredata

[0040] Accordingly, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is uttered,since the likelihood which indicates probability that at least part ofthe feature values of the extracted spontaneous speech is matched withthe extraneous-speech components feature data can be adjusted by the setcorrection value, the keyword and the extraneous-speech can beidentified properly. Therefore, it is possible to prevent misrecognitionand recognize keyword reliably.

[0041] In one aspect of the present invention, the speech recognitionmethod of the present invention is further provided with; a settingprocess of setting the correction value based on the ratio betweenduration of the determined keyword and duration of the spontaneousspeech when the determination process determines at least one of thekeywords to be recognized and the extraneous speech based on thecalculated likelihood, wherein the calculation process calculates thelikelihood by using the set correction value when the calculationprocess calculates the likelihood which indicates probability that atleast part of the feature values of the extracted spontaneous speech ismatched with the extraneous-speech feature data.

[0042] According to the present invention, the determined correctionvalue is set based on the ratio between duration of the determinedkeyword and duration of the spontaneous speech, and likelihood iscalculated based on the feature values of the extracted spontaneousspeech, the extraneous-speech feature data adjusted by the setcorrection value, and the acquired keyword feature data

[0043] Accordingly, even under conditions in which misrecognition couldoccur due to such as calculation error produced when calculating thelikelihood using extraneous-speech feature data combined characteristicsof a plurality of feature values to reduce the amount of data, since thelikelihood which indicates probability that at least part of the featurevalues of the extracted spontaneous speech is matched with theextraneous-speech components feature data can be adjusted by the setcorrection value, the keyword and the extraneous-speech can beidentified properly. Therefore, it is possible to prevent misrecognitionand recognize keyword reliably.

[0044] In one aspect of the present invention, the speech recognitionmethod of the present invention is further provided with; wherein theextraneous-speech feature data prestored in the database has data offeature values of speech ingredient of a plurality of theextraneous-speech.

[0045] According to the present invention, the likelihood is calculatedbased on the extracted spontaneous-speech feature values, the adjustedextraneous-speech feature data which has data of feature values ofspeech ingredient of a plurality of the extraneous-speech, and theacquired keyword feature data

[0046] Accordingly, since the likelihood is calculated based on data offeature values of speech ingredient of a plurality of theextraneous-speech, it is possible to identify the extraneous speechproperly using a small amount of data in recognizing the extraneousspeech. Furthermore, even under conditions in which misrecognition couldoccur due to such as calculation error produced when the likelihoodcalculated by using extraneous-speech feature data combinedcharacteristics of a plurality of feature values to reduce the amount ofdata, since the likelihood which indicates probability that at leastpart of the feature values of the extracted spontaneous speech ismatched with the extraneous-speech components feature data can beadjusted by the set correction value, the keyword and theextraneous-speech can be identified properly. Therefore, it is possibleto prevent misrecognition and recognize keyword reliably.

[0047] In one aspect of the present invention, the speech recognitionmethod of the present invention is further provided with, in case wherean extraneous-speech component feature data indicating feature value ofspeech ingredient of extraneous-speech component which is component ofthe extraneous speech is prestored in the database, wherein: thecalculation process of calculating likelihood based on theextraneous-speech component feature data when the calculation processcalculates the likelihood which indicates probability that at least partof the feature values of the extracted spontaneous speech is matchedwith the extraneous-speech feature data, and the determination processof determining at least one of the keywords to be recognized and theextraneous-speech based on the calculated likelihood.

[0048] According to the present invention, the likelihood is calculatedbased on the extracted spontaneous-speech feature value, the adjustedextraneous-speech component feature data and the acquired keywordfeature data, and at least one of the keywords to be recognized and theextraneous-speech is determined based on the calculated likelihood.

[0049] Accordingly, since the extraneous-speech and the keyword areidentified by calculating the likelihood based on the adjustedextraneous-speech component feature data, the extraneous-speech can beidentified properly by using a small amount of data in recognizing theextraneous speech. Therefore, it is possible to increase identifiableextraneous speech without increasing the amount of data required torecognize extraneous speech and improve the accuracy with which keywordis extracted and recognized.

[0050] Furthermore, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is utteredor due to such as calculation error produced when the likelihood iscalculated by using extraneous-speech feature data combinedcharacteristics of a plurality of feature values to reduce the amount ofdata, since the likelihood which indicates probability that at leastpart of the feature values of the extracted spontaneous speech ismatched with the extraneous-speech components feature data can beadjusted by the predetermined correction value, the keyword and theextraneous-speech can be identified properly. Therefore, it is possibleto prevent misrecognition and recognize keyword reliably.

[0051] the above object of present invention can be achieved by arecording medium of the present invention. The recording medium is arecording medium wherein a speech recognition program is recorded so asto be read by a computer, the computer included in a speech recognitionapparatus for recognizing at least one of keywords contained in utteredspontaneous speech, the program causing the computer to function as: anextraction device for extracting a spontaneous-speech feature value,which is feature value of speech ingredient of the spontaneous speech,by analyzing the spontaneous speech; an acquiring device for acquiringat least one of keyword feature data indicating feature value of speechingredient of the keyword and at least one of an extraneous-speechfeature data indicating feature value of speech ingredient ofextraneous-speech, the keyword feature data and extraneous-speechfeature data prestoring in a database; a calculation device forcalculating likelihood which indicates probability that at least part ofthe feature values of the extracted spontaneous speech is matched withthe keyword feature data and the extraneous-speech feature data; and adetermining device for determining at least one of the keywords to berecognized and the extraneous-speech based on the calculated likelihood,wherein the calculation device calculates the likelihood by using apredetermined correction value when the calculation device calculatesthe likelihood which indicates probability that at least part of thefeature values of the extracted spontaneous speech is matched with theextraneous-speech feature data.

[0052] According to the present invention, the likelihood is calculatedbased on the extracted spontaneous-speech feature value and theextraneous-speech feature data adjusted by a predetermined correctionvalue, and at least one of the keywords and the extraneous speech to berecognized are determined based on the calculated likelihood.

[0053] Accordingly, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is utteredor due to such as calculation error produced when the likelihoodcalculated by using extraneous-speech feature data combinedcharacteristics of a plurality of feature values to reduce the amount ofdata, since the likelihood which indicates probability that at leastpart of the feature values of the extracted spontaneous speech ismatched with the extraneous-speech components feature data can beadjusted by the predetermined correction value, the keyword and theextraneous-speech can be identified properly. Therefore, it is possibleto prevent misrecognition and recognize keyword reliably.

[0054] In one aspect of the present invention, the speech recognitionprogram causes the computer to function as a setting device for settingthe correction value based on noise level around where the spontaneousspeech is uttered, wherein the calculation device calculates thelikelihood by using the set correction value when the calculation devicecalculates the likelihood which indicates probability that at least partof the feature values of the extracted spontaneous speech is matchedwith the extraneous-speech feature data.

[0055] According to the present invention, the determined correctionvalue is set based on noise level around where the spontaneous speech isuttered, and likelihood is calculated based on the feature values of theextracted spontaneous speech, the extraneous-speech feature dataadjusted by the set correction value, and the acquired keyword featuredata

[0056] Accordingly, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is uttered,since the likelihood which indicates probability that at least part ofthe feature values of the extracted spontaneous speech is matched withthe extraneous-speech components feature data can be adjusted by the setcorrection value, the keyword and the extraneous-speech can beidentified properly. Therefore, it is possible to prevent misrecognitionand recognize keyword reliably.

[0057] In one aspect of the present invention, the speech recognitionprogram causes the computer to function as; a setting device for settingthe correction value based on the ratio between duration of thedetermined keyword and duration of the spontaneous speech when thedetermining device determines at least one of the keywords to berecognized and the extraneous speech based on the calculated likelihood;and the calculation device calculates the likelihood by using the setcorrection value when the calculation device calculates the likelihoodwhich indicates probability that at least part of the feature values ofthe extracted spontaneous speech is matched with the extraneous-speechfeature data

[0058] According to the present invention, the determined correctionvalue is set based on the ratio between duration of the determinedkeyword and duration of the spontaneous speech, and likelihood iscalculated based on the feature value of the extracted spontaneousspeech, the extraneous-speech feature data adjusted by the setcorrection value, and the acquired keyword feature data

[0059] Accordingly, even under conditions in which misrecognition couldoccur due to such as calculation error produced when calculating thelikelihood using extraneous-speech feature data combined characteristicsof a plurality of feature values to reduce the amount of data, since thelikelihood which indicates probability that at least part of the featurevalues of the extracted spontaneous speech is matched with theextraneous-speech components feature data can be adjusted by the setcorrection value, the keyword and the extraneous-speech can beidentified properly. Therefore, it is possible to prevent misrecognitionand recognize keyword reliably.

[0060] In one aspect of the present invention, speech recognitionprogram causes the computer to function as the extraneous-speech featuredata prestored in the database has data of feature values of speechingredient of a plurality of the extraneous-speech.

[0061] According to the present invention, the likelihood is calculatedbased on the extracted spontaneous-speech feature values, the adjustedextraneous-speech feature data which has data of feature values ofspeech ingredient of a plurality of the extraneous-speech, and theacquired keyword feature data

[0062] Accordingly, since the likelihood is calculated based on data offeature values of speech ingredient of a plurality of theextraneous-speech, it is possible to identify the extraneous speechproperly using a small amount of data in recognizing the extraneousspeech. Furthermore, even under conditions in which misrecognition couldoccur due to such as calculation error produced when the likelihoodcalculated by using extraneous-speech feature data combinedcharacteristics of a plurality of feature values to reduce the amount ofdata, since the likelihood which indicates probability that at leastpart of the feature values of the extracted spontaneous speech ismatched with the extraneous-speech components feature data can beadjusted by the set correction value, the keyword and theextraneous-speech can be identified properly. Therefore, it is possibleto prevent misrecognition and recognize keyword reliably.

[0063] In one aspect of the present invention, in case where anextraneous-speech component feature data indicating feature value ofspeech ingredient of extraneous-speech component which is component ofthe extraneous speech is prestored in the database, the speechrecognition program causes the computer to function as: the calculationdevice for calculating likelihood based on the extraneous-speechcomponent feature data when the calculation device calculates thelikelihood which indicates probability that at least part of the featurevalues of the extracted spontaneous speech is matched with theextraneous-speech feature data, and the determining device fordetermining at least one of the keywords to be recognized and theextraneous-speech based on the calculated likelihood.

[0064] According to the present invention, the likelihood is calculatedbased on the extracted spontaneous-speech feature value, the adjustedextraneous-speech component feature data and the acquired keywordfeature data, and at least one of the keywords to be recognized and theextraneous-speech is determined based on the calculated likelihood.

[0065] Accordingly, since the extraneous-speech and the keyword areidentified by calculating the likelihood based on the adjustedextraneous-speech component feature data, the extraneous-speech can beidentified properly by using a small amount of data in recognizing theextraneous speech. Therefore, it is possible to increase identifiableextraneous speech without increasing the amount of data required torecognize extraneous speech and improve the accuracy with which keywordis extracted and recognized.

[0066] Furthermore, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is utteredor due to such as calculation error produced when the likelihood iscalculated by using extraneous-speech feature data combinedcharacteristics of a plurality of feature values to reduce the amount ofdata, since the likelihood which indicates probability that at leastpart of the feature values of the extracted spontaneous speech ismatched with the extraneous-speech components feature data can beadjusted by the predetermined correction value, the keyword and theextraneous-speech can be identified properly.

[0067] Therefore, it is possible to prevent misrecognition and recognizekeyword reliably.

BRIEF DESCRIPTION OF THE DRAWINGS

[0068]FIG. 1 is a diagram showing a speech recognition apparatusaccording to a first embodiment of the present invention, wherein anHMM-based speech language model is used;

[0069]FIG. 2 is a diagram showing an HMM-based speech language model forrecognizing arbitrary spontaneous speech;

[0070]FIG. 3A is graphs showing cumulative likelihood of anextraneous-speech HMM for an arbitrary combination of extraneous speechand a keyword;

[0071]FIG. 3B is graphs showing cumulative likelihood ofextraneous-speech component HMM for an arbitrary combination ofextraneous speech and a keyword;

[0072]FIG. 4 is an exemplary diagram showing how transitions take placein speech language model states when a correction value is added to orsubtracted from likelihood;

[0073]FIG. 5 is a diagram showing configuration of a speech recognitionapparatus according to a first embodiment of the present invention;

[0074]FIG. 6 is a flowchart showing operation of a keyword recognitionprocess according to the first embodiment;

[0075]FIG. 7 is a diagram showing configuration of a speech recognitionapparatus according to a second embodiment of the present invention; and

[0076]FIG. 8 is a flowchart showing operation of a keyword recognitionprocess according to the second embodiment.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0077] The present invention will now be described with reference topreferred embodiment shown in the drawings.

[0078] The embodiments described below are embodiments in which thepresent invention is applied to speech recognition apparatus.

[0079] Extraneous-speech components described in this embodimentrepresent basic phonetic units, such as phonemes or syllables, whichcompose speech, but syllables will be used in this embodiment forconvenience of the following explanation.

[0080] [First Embodiment]

[0081] FIGS. 1 to 6 are diagrams showing a first embodiment of a speechrecognition apparatus according to the present invention.

[0082] First, an HMM-based speech language model according to thisembodiment will be described with reference to FIG. 1 and FIG. 2.

[0083]FIG. 1 is a diagram showing an HMM-based speech language model ofa recognition network according to this embodiment, and FIG. 2 is adiagram showing a speech language model for recognizing arbitraryspontaneous speech using arbitrary HMMs.

[0084] This embodiment assumes a model (hereinafter referred to as aspeech language model) which represents an HMM-based recognition networksuch as the one shown in FIG. 1, i.e., a speech language model 10 whichcontains keywords to be recognized.

[0085] The speech language model 10 consists of keyword models 11connected at both ends with garbage models (hereinafter referred to ascomponent models of extraneous-speech) 12 a and 12 b which representcomponents of extraneous speech. In case where keyword contained inspontaneous speech is recognized, a keyword contained in spontaneousspeech is identified by matching the keyword with the keyword models 11,and extraneous speech contained in spontaneous speech is identified bymatching the extraneous speech with the component models ofextraneous-speech 12 a and 12 b. Actually, the keyword models 11 andcomponent models of extraneous-speech 12 a and 12 b represent a set ofstates which transition each arbitrary segments of spontaneous speech.The statistical source models “HMMs” which is an unsteady sourcerepresented by combination of steady sources composes the spontaneousspeech.

[0086] The HMMs of the keyword models 11 (hereinafter referred to askeyword HMMs) and the HMMs of the extraneous-speech component models 12a and 12 b (hereinafter referred to as extraneous-speech component HMMs)have two types of parameter: One parameter is a state transitionprobability which represents the probability of the state transitionfrom one state to another, and another parameter is an outputprobability which outputs the probability that a vector (feature vectorfor each frame) will be observed when a state transitions from one stateto another. Thus, the HMMs of the keyword models 11 represents a featurepattern of each keyword, and extraneous-speech component HMMs 12 a and12 b represents feature pattern of each extraneous-speech component.

[0087] Generally, since even the same word or syllable shows acousticvariations for various reasons, speech sounds composing spontaneousspeech vary greatly with the speaker. However, even if uttered bydifferent speakers, the same speech sound can be characterized mainly bya characteristic spectral envelope and its time variation. Stochasticcharacteristic of a time-series pattern of such acoustic variation canbe expressed precisely by an HMM.

[0088] Thus, as described below, in this embodiment, keywords containedin the spontaneous speech are recognized by matching feature values ofthe inputted spontaneous speech with keyword HMMs and extraneous-speechHMMs and calculating likelihood.

[0089] Incidentally, the likelihood indicates probability that featurevalues of the inputted spontaneous speech is matched with keyword HMMsand extraneous-speech.

[0090] According to this embodiment, a HMM is a feature pattern ofspeech ingredient of each keyword or feature value of speech ingredientof each extraneous-speech component. Furthermore, the HMM is aprobability model which has spectral envelope data that represents powerat each frequency at each regular time intervals or cepstrum dataobtained from an inverse Fourier transform of a logarithm of the powerspectrum.

[0091] Furthermore, the HMMs are created and stored beforehand in eachdatabases by acquiring spontaneous speech data of each phonemes utteredby multiple people, extracting feature patterns of each phonemes, andlearning feature pattern data of each phonemes based on the extractedfeature patterns of the phonemes.

[0092] When keywords contained in spontaneous speech are recognized byusing such HMMs, the spontaneous speech to be recognized is divided intosegments of a predetermined duration and each segment is matched witheach prestored data of the HMMs, and then the probability of the statetransition of these segments from one state to another are calculatedbased on the results of the matching process to identify the keywords tobe recognized.

[0093] Specifically, in this embodiment, the feature value of eachspeech segment are compared with the each feature pattern of prestoreddata of the HMMs, the likelihood for the feature value of each speechsegment to match the HMM feature patterns is calculated, cumulativelikelihood which represents the probability for a connection among allHMMs, i.e., a connection between a keyword and extraneous speech iscalculated by using matching process (described later), and thespontaneous speech is recognized by detecting the HMM connection withthe highest likelihood.

[0094] The HMM, which represents an output probability of a featurevector, generally has two parameters: a state transition probability aand an output probability b, as shown in FIG. 2. The output probabilityof an inputted feature vector is given by a combined probability of amultidimensional normal distribution and the likelihood of each state isgiven by Eq. (1). $\begin{matrix}{{b_{i}(x)} = {\frac{1}{\sqrt{ ( {2\quad \pi} )^{P} \middle|  \sum\limits_{i} | }}{\exp ( {{- \frac{1}{2}}( {x - \mu_{i}} )^{t}{\overset{- 1}{\sum\limits_{i}}( {x - \mu_{i}} )}} )}}} & {{Eq}.\quad (1)}\end{matrix}$

[0095] Incidentally x is the feature vector of an arbitrary speechsegment, Σ_(i) is a covariance matrix, λ is a mixing ratio, μ_(i) is anaverage vector of feature vectors learned in advance, and P is thenumber of dimensions of the feature vector of the arbitrary speechsegment.

[0096]FIG. 2 is a diagram showing a state transition probability a whichindicates a probability when an arbitrary state i changes to anotherstate (i+n),and output probability b with respect to the statetransition probability a. Each graph in FIG. 2 shows an outputprobability that an inputted feature vector in a given state will beoutput.

[0097] Actually, logarithmic likelihood, which is the logarithm of Eq.(1) above, is often used for speech recognition, as shown in Eq. (2).$\begin{matrix}{{\log \quad {b_{i}(x)}} =  {{- \frac{1}{2}}{\log \lbrack ( {2\quad \pi} ) \rbrack}^{P}} \middle| {\sum\limits_{i}| {{- \frac{1}{2}}( {x - \mu_{i}} )^{t}{\overset{- 1}{\sum\limits_{i}}( {x - \mu_{i}} )}} } } & {{Eq}.\quad (2)}\end{matrix}$

[0098] Next, an extraneous-speech component HMM which is a garbage modelwill be described with reference to FIG. 3.

[0099]FIG. 3 is graphs showing cumulative likelihood of anextraneous-speech HMM and extraneous-speech component HMM in anarbitrary combination of extraneous speech and a keyword.

[0100] As described above, in the case of conventional speechrecognition apparatus, since extraneous-speech models are composed ofHMMs which represent feature values of extraneous speech as with keywordmodels, to identify extraneous speech contained in spontaneous speech,the extraneous speech to be identified must be stored beforehand in adatabase.

[0101] The extraneous speech to be identified can include all speechexcept keywords ranging from words which do not constitute keywords tounrecognizable speech with no linguistic content. Consequently, torecognize extraneous speech contained in spontaneous speech properly,HMMs must be prepared in advance for a huge volume of extraneous speech.

[0102] Thus, in the conventional speech recognition apparatus, data onfeature values of every extraneous speech must be acquired to recognizeextraneous speech contained in spontaneous speech properly, for example,by storing it in databases. Accordingly, a huge amount of data must bestored in advance, but it is physically impossible to secure areas forstoring the data.

[0103] Furthermore, in the conventional speech recognition apparatus, ittakes a large amount of labor to generate the huge amount of data to bestored in databases or the like.

[0104] On the other hand, extraneous speech is also a type of speech,and thus it consists of components such as syllables and phonemes, whichare generally limited in quantity.

[0105] Thus, if extraneous speech contained in spontaneous speech isidentified based on the extraneous-speech components, it is possible toreduce the amount of data to be prepared as well as to identify everyextraneous speech properly.

[0106] Specifically, since any extraneous speech can be composed bycombining components such as syllables and phonemes, if extraneousspeech is identified using data on such components prepared in advance,it is possible to reduce the amount of data to be prepared and identifyevery extraneous speech properly.

[0107] Generally, a speech recognition apparatus which recognizeskeywords contained in spontaneous speech divides the spontaneous speechinto speech segments at predetermined time intervals (as describedlater), calculates likelihood that the feature value of each speechsegment matches a garbage model (such as an extraneous-speech HMM) oreach keyword model (such as a keyword HMM) prepared in advance,accumulates the likelihood of each combination of a keyword andextraneous speech based on the calculated likelihoods of each speechsegments of each extraneous speech HMM and each keyword model HMM, andthereby calculates cumulative likelihood which represents HMMconnections.

[0108] When extraneous-speech HMMs to recognize the extraneous speechincluded in the spontaneous speech are not prepared in advance as is thecase with conventional speech recognition apparatus, feature values ofspeech in the portion corresponding to extraneous speech in spontaneousspeech show low likelihood of a match with both extraneous-speech HMMsand keywords HMMs as well as low cumulative likelihood of them, whichwill cause misrecognition.

[0109] However, when speech segments are matched with anextraneous-speech component HMM, feature values of extraneous speech inspontaneous speech shows high likelihood of match with prepared datawhich represents feature values of extraneous-speech component HMMs.Consequently, if feature values of a keyword contained in thespontaneous speech match keyword HMM data, cumulative likelihood of thecombination of the keyword and the extraneous speech contained in thespontaneous speech is high, making it possible to recognize the keywordproperly.

[0110] For example, when extraneous-speech HMMs which indicates garbagemodels of the extraneous speech contained in spontaneous speech areprovided in advance as shown in FIG. 3(a), there is no difference incumulative likelihood from the case where an extraneous-speech componentHMM is used, but when extraneous-speech HMMs which indicates garbagemodels of the extraneous speech contained in spontaneous speech are notprovided in advance as shown in FIG. 3(b), cumulative likelihood is lowcompared with the case where an extraneous-speech component HMM is used.

[0111] Thus, since this embodiment calculates cumulative likelihoodusing the extraneous-speech component HMM and thereby identifiesextraneous speech contained in spontaneous speech, it can identify theextraneous speech properly and recognize keywords, using a small amountof data.

[0112] Next, with reference to FIG. 4, description will be given of howto adjust likelihoods by adding a correction value to theextraneous-speech component HMM according to this embodiment.

[0113]FIG. 4 is an exemplary diagram showing how transitions take placein speech language model states when a correction value is added to orsubtracted from likelihood.

[0114] According to this embodiment, when calculating the likelihood ofa match between each feature data of the extraneous-speech component HMMprepared in advance and the feature value of each frame, a correctionvalue is added to the likelihood.

[0115] Specifically, according to this embodiment, as shown in Eq. (3),the correction value α is added only to the likelihood of a match—givenby Eq. (2) above—between the feature data of the extraneous-speechcomponent HMM and the feature value of each frame to adjust. In thisway, the probabilities which represent each likelihoods are adjustedforcefully. $\begin{matrix}{{\log \quad\lbrack {b_{i}(x)} \rbrack} =  {{- \frac{1}{2}}{\log \lbrack ( {2\quad \pi} ) \rbrack}^{P}} \middle| {\sum\limits_{i}| {{{- \frac{1}{2}}( {x - \mu_{i}} )^{t}{\overset{- 1}{\sum\limits_{i}}( {x - \mu_{i}} )}} + \alpha} } } & {{Eq}.\quad (3)}\end{matrix}$

[0116] According to this embodiment, as described later, extraneousspeech is identified by using an HMM which represents feature values ofextraneous-speech components. Basically, a single extraneous-speechcomponent HMM has features of all components of extraneous speech suchas phonemes and syllables, and thus every extraneous speech isidentified by using this extraneous-speech component HMM.

[0117] However, the extraneous-speech component HMM which covers all thecomponents has a lower likelihood of a match to the extraneous-speechcomponents composing the extraneous speech to be identified than doextraneous-speech component HMMs each of which has the feature value ofonly one component. Consequently, if this method is used in calculatingcumulative likelihood over the entire spontaneous speech, a combinationof extraneous speech and a keyword irrelevant to spontaneous speech maybe recognized.

[0118] In other words, a combination of extraneous speech and a keywordto be recognized may have a lower cumulative likelihood than the onecalculated for another combination of other extraneous speech and akeyword, resulting in misrecognition.

[0119] Therefore, as shown in Eq. (3) above, according to thisembodiment, misrecognition is prevented by adding the correction value αonly when the likelihood of the extraneous-speech component HMM iscalculated and adjusting the calculated likelihood in such a way as toincrease the likelihood of the appropriate combination of theextraneous-speech component HMM and keyword HMM over other combinations.

[0120] Specifically, as shown in FIG. 4, when the correction value αwhich is added to calculate the likelihood of the extraneous-speechcomponent HMM is positive, the likelihood of a match between the featurevector of each frame of the spontaneous speech and the extraneous-speechcomponent HMM becomes high. Consequently, the computational accuracy oflikelihoods except the likelihood of keyword HMMs increases duringspeech recognition of the spontaneous speech, making speech recognitionsegments except those for keywords longer than when the correction valueα is not added.

[0121] Conversely, when the correction value α is negative, thelikelihood of a match between the feature vector of each frame of thespontaneous speech and the extraneous-speech component HMM becomes low.Consequently, the computational accuracy of likelihood except thelikelihoods of keyword HMMs decreases during speech recognition of thespontaneous speech, making speech recognition segments except those forkeywords shorter than when the correction value α is not added.

[0122] Therefore, in addition to generating the extraneous-speechcomponent HMM of each frame, storing it in the garbage model database,and calculating their likelihood, according this embodiment,misrecognition is prevented by adding the correction value α only whenthe likelihood of the extraneous-speech component HMM is calculated andadjusting the calculated likelihood in such a way as to increase thelikelihood of the appropriate combination of the extraneous-speechcomponent HMM and keyword HMM.

[0123] In this embodiment, as described later, the correction value α isset according to the noise level around where the spontaneous speech isuttered.

[0124] Next, configuration of the speech recognition apparatus accordingto this embodiment will be described with reference to FIG. 5.

[0125]FIG. 5 is a diagram showing the configuration of the speechrecognition apparatus according to the first embodiment of the presentinvention.

[0126] As shown in FIG. 5, the speech recognition apparatus 100comprises: a microphone 101 which receives spontaneous speech andconverts it into electrical signals (hereinafter referred to as speechsignals); input processor 102 which extracts speech signals that ismatched with speech sounds from the inputted speech signals and splitsframes at a preset time interval; speech analyzer 103 which extracts afeature value of a speech signal in each frame; keyword model database104 which prestores keyword HMMs which represent feature patterns of aplurality of keywords to be recognized; garbage model database 105 whichprestores the extraneous-speech component HMM which represents featurepatterns of extraneous-speech to be distinguished from the keywords;first likelihood calculator 106 which calculates the likelihood that theextracted feature value of each frame match the keyword HMMs; secondlikelihood calculator 107 which calculates the likelihood that theextracted feature value of each frame match the extraneous-speechcomponent HMMs; correction processor 108 which makes corrections basedon the noise level of collected surrounding sounds when calculatinglikelihood for each frame based on the feature value of the frame andextraneous-speech component HMM; matching processor 109 which performs amatching process (described later) based on the likelihood calculated ona frame-by-frame HMMs basis; and determining device 110 which determinesthe keywords contained in the spontaneous speech based on the results ofthe matching process.

[0127] The speech analyzer 103 serves as extraction device of thepresent invention, the keyword model database 104 and garbage modeldatabase 105 serve as storage device of the present invention. The firstlikelihood calculator 106 and second likelihood calculator 107 serve ascalculation device and acquisition device of the present invention, thematching processor 109 and determining device 110 serve as determiningdevice of the present invention.

[0128] In the input processor 102, the speech signals outputted from themicrophone 101 is inputted. In the input processor 102 extracts thoseparts of the speech signals which represent speech segments ofspontaneous speech from the inputted speech signals, divides theextracted parts of the speech signals into time interval frames of apredetermined duration, and outputs them to the speech analyzer 103.

[0129] For example, a frame has a duration about 10 ms to 20 ms.

[0130] The speech analyzer 103 analyzes the inputted speech signalsframe by frame, extracts the feature value of the speech signal in eachframe, and outputs it to the likelihood calculator 106.

[0131] Specifically, the speech analyzer 103 extracts spectral envelopedata that represents power at each frequency at regular time intervalsor cepstrum data obtained from an inverse Fourier transform of thelogarithm of the power spectrum as the feature values of speechingredient on a frame-by-frame basis, converts the extracted featurevalues into vectors, and outputs the vectors to the first likelihoodcalculator 106 and the second likelihood calculator 107.

[0132] The keyword model database 104 prestores keyword HMMs whichrepresent pattern data of the feature values of the keywords to berecognized. Data of these stored a plurality of keyword HMMs representpatterns of the feature values of a plurality of the keywords to berecognized.

[0133] For example, if it is used in navigation system mounted a mobile,the keyword model database 104 is designed to store HMMs which representpatterns of feature values of speech signals including destination namesor present location names or facility names such as restaurant names forthe mobile.

[0134] As described above, according to this embodiment, an HMM whichrepresents a feature pattern of speech ingredient of each keywordrepresents a probability model which has spectral envelope data thatrepresents power at each frequency at regular time intervals or cepstrumdata obtained from an inverse Fourier transform of the logarithm of thepower spectrum.

[0135] Since a keyword normally consists of a plurality of phonemes orsyllables as is the case with “present location” or “destination,”according to this embodiment, one keyword HMM consists of a plurality ofkeyword component HMMs and the first likelihood calculator 106calculates frame-by-frame feature values and likelihood of each keywordcomponent HMM.

[0136] In this way, the keyword model database 104 stores each keywordHMMs of the keywords to be recognized, that is, keyword component HMMs.

[0137] The garbage model database 105 prestores the HMM “theextraneous-speech component HMM” which is a language model used torecognize the extraneous speech and represents pattern data of featurevalues of extraneous-speech components.

[0138] According to this embodiment, the garbage model database 105stores one HMM which represents feature values of extraneous-speechcomponents. For example, if a unit of syllable-based HMM is stored, thisextraneous-speech component HMM contains feature patterns which coverfeatures of all syllables such as the Japanese syllablary, nasal, voicedconsonants, and plosive consonants.

[0139] Generally, to generate an HMM of a feature value for eachsyllable, speech data of each syllables uttered by multiple people ispreacquired, the feature pattern of each syllable is extracted, andfeature pattern data of each syllable is learned based on the eachsyllable-based feature pattern. According to this embodiment, however,when generating the speech data, an HMM of all feature patterns isgenerated based on speech data of all syllables and the single HMM—alanguage model—is generated which represents the feature values of aplurality of syllables.

[0140] Thus, according to this embodiment, based on the generatedfeature pattern data, the single HMM, which is a language model, hasfeature patterns of all syllables is generated, and it is converted intoa vector, and prestored in the garbage model database 105.

[0141] In he first likelihood calculator 106, the feature vector of eachframe is inputted. Then, by comparing the feature values of eachinputted frames and the feature values of keyword HMMs stored in thekeyword model database 104, the first likelihood calculator 106calculates the likelihood of a match between each frame and each keywordHMM, and outputs the calculated likelihood to the matching processor109.

[0142] According to this embodiment, the first likelihood calculator 106calculates probabilities, including the probability of each framecorresponding to each HMM stored in the keyword model database 104 basedon each feature values of each frames and the feature values of the HMMsstored in the keyword model database 104.

[0143] Specifically, the first likelihood calculator 106 calculatesoutput probability which represents the probability of each framecorresponding to each keyword component HMM. Furthermore, it calculatesstate transition probability which represents the probability that astate transition from an arbitrary frame to the next frame is matchedwith a state transition from each keyword component HMM to anotherkeyword component HMM or an extraneous-speech component. Then, the firstlikelihood calculator 106 outputs these calculated probabilities aslikelihoods to the matching processor 109.

[0144] Incidentally, state transition probabilities includeprobabilities of a state transition from a keyword component HMM to thesame keyword component HMM as well.

[0145] Furthermore, the first likelihood calculator 106 outputs eachoutput probability and each state transition probability calculated foreach frame as likelihood for each frame to the matching processor 109.

[0146] In the second likelihood calculator 107, a correction valueoutputted by the correction processor 108 and each feature vector ofeach frame are inputted. Then, by comparing the feature values ofinputted frames and the feature value of the extraneous-speech componentHMM stored in the garbage model database 105 and by adding thecorrection value, the second likelihood calculator 107 calculates thelikelihood of a match between each frame and the extraneous-speechcomponent HMM.

[0147] According to this embodiment, based on the feature value of eachframe and the feature value of the component HMM stored in the garbagemodel database 105, the second likelihood calculator 107 calculates theprobability of each frame corresponding to the HMM stored in the garbagemodel database 105.

[0148] Specifically, the second likelihood calculator 107 calculatesoutput probability which represents the probability of each framecorresponding to the extraneous-speech component HMM. Furthermore, itcalculates state transition probability which represents the probabilitythat a state transition from an arbitrary frame to the next frame ismatched with a state transition from an extraneous-speech component toeach keyword component HMM. Then, the second likelihood calculator 107outputs these calculated probabilities as likelihoods to the matchingprocessor 109.

[0149] Incidentally, state transition probabilities includeprobabilities of a state transition from an extraneous-speech componentHMM to the same extraneous-speech component HMM as well.

[0150] The second likelihood calculator 107 outputs each outputprobability and each state transition probability calculated for eachframe as likelihood for each frame to the matching processor 109.

[0151] In the correction processor 108, surrounding sounds ofspontaneous speech collected by a microphone (not shown) are inputted,the correction processor 108 calculates a correction value based on theinputted surrounding sounds, and outputs the correction value to thesecond likelihood calculator 107 to set the correction value therein.

[0152] For example, according to this embodiment, the correction valuefor the extraneous-speech component HMM is calculated based on the noiselevel of the collected surrounding sounds. Specifically, when the noiselevel is equal or under −56 dB, the correction value α is given by Eq.(4).

α=β×(−0.10)  Eq.(4)

[0153] Incidentally β represents the likelihood calculated by theextraneous-speech component HMM. When the noise level is −55 dB to −40dB, the correction value α is given by Eq. (5).

α=β×(−0.05)  Eq. (5)

[0154] When the noise level is −39 dB to −0 dB, no correction value isused and the zero correction value is set in the second likelihoodcalculator 107.

[0155] In the matching processor 109, each frame-by-frame outputprobabilities and each (inputted) state transition probabilities areinputted, the matching processor 109 performs a matching process tocalculate cumulative likelihood, which is the likelihood of eachcombination of each keyword component HMM and the extraneous-speechcomponent HMM, based on each inputted output probabilities and each(inputted) state transition probabilities, and outputs the cumulativelikelihood to the determining device 110.

[0156] Specifically, the matching processor 109 calculates onecumulative likelihood for each keyword (as described later), andcumulative likelihood without a keyword, i.e., cumulative likelihood ofthe extraneous-speech component model alone.

[0157] Incidentally, details of the matching process performed by thematching processor 109 will be described later.

[0158] In the determining device 110, the cumulative likelihood of eachkeyword which is calculated by the matching processor 109 is inputted,and the determining device 110 outputs the keyword with the highestcumulative likelihood determines it as a keyword contained in thespontaneous speech externally.

[0159] In deciding on the keyword, the determining device 110 uses thecumulative likelihood of the extraneous-speech component model alone aswell. If the extraneous-speech component model used alone has thehighest cumulative likelihood, the determining device 110 determinesthat no keyword is contained in the spontaneous speech and outputs thisresult externally.

[0160] Next, description will be given about the matching processperformed by the matching processor 109 according to this embodiment.

[0161] The matching process according to this embodiment calculates thecumulative likelihood of each combination of a keyword model and anextraneous-speech component model using the Viterbi algorithm.

[0162] The Viterbi algorithm is an algorithm which calculates thecumulative likelihood based on the output probability of entering eachgiven state and the transition probability of transitioning from eachstate to another state, and then outputs the combination whosecumulative likelihood has been calculated after the cumulativeprobability.

[0163] Generally, the cumulative likelihood is calculated first byintegrating each Euclidean distance between the state represented by thefeature value of each frame and the feature value of the staterepresented by each HMM, and then is calculated by calculating thecumulative distance.

[0164] Specifically, the Viterbi algorithm calculates cumulativeprobability based on a path which represents a transition from anarbitrary state i to a next state j, and thereby extracts each paths,i.e., connections and combinations of HMMs, through which statetransitions can take place.

[0165] In this embodiment, the first likelihood calculator 106 and thesecond likelihood calculator 107 calculate each output probabilities andeach state transition probabilities by matching the output probabilitiesof keyword models or the extraneous-speech component model and therebystate transition probabilities against the frames of the inputtedspontaneous speech one by one beginning with the first divided frame andending with the last divided frame, calculates the cumulative likelihoodof an arbitrary combination of a keyword model and extraneous-speechcomponents from the first divided frame to the last divided frame,determines the arrangement which has the highest cumulative likelihoodin each keyword model/extraneous-speech component combination by eachkeyword model, and outputs the determined cumulative likelihoods of thekeyword models one by one to the determining device 110.

[0166] For example, in case where the keywords to be recognized are“present location” and “destination” and the inputted spontaneous speechentered is “er, present location”, the matching process according tothis embodiment is performed as follows.

[0167] It is assumed here that extraneous speech is “er,” that thegarbage model database 105 contains one extraneous-speech component HMMwhich represents features of all extraneous-speech components, that thekeyword database contains HMMs of each syllables of “present” and“destination,” and that each output probabilities and state transitionprobabilities calculated by the likelihood calculator 106 and the secondlikelihood calculator 107 have already been inputted in the matchingprocessor 109.

[0168] In such a case, according to this embodiment, the Viterbialgorithm calculates cumulative likelihood of all arrangements in eachcombination of the keyword and extraneous-speech components for thekeywords “present” and “destination” based on the output probabilitiesand state transition probabilities.

[0169] Specifically, when an arbitrary spontaneous speech is inputted,cumulative likelihoods of the following patterns of each combination arecalculated based on the output probabilities and state transitionprobabilities: “p-r-e-se-n-t ####,” “# p-r-e-se-n-t ####,” “##p-r-e-se-n-t ##,” “### p-r-e-se-n-t #,” and “#### p-r-e-se-n-t” for thekeyword of “p-r-e-se-n-t” and “d-e-s-t-i-n-a-ti-o-n ####,” “#d-e-s-t-i-n-a-ti-o-n ###,” “## d-e-s-t-i-n-a-ti-o-n ##,” “###d-e-s-t-i-n-a-ti-o-n #,” and “#### d-e-s-t-i-n-a-ti-o-n” for the keywordof “destination” (where # indicates an extraneous-speech component).

[0170] The Viterbi algorithm calculates the cumulative likelihoods ofall combination patterns over all the frame of spontaneous speechbeginning with the first frame for each keyword, in this case, “presentlocation” and “destination.”

[0171] Furthermore, in the process of calculating the cumulativelikelihoods of each arrangement for each keyword, the Viterbi algorithmstops calculation halfway for those arrangements which have lowcumulative likelihood, determining that the spontaneous speech do notmatch those combination patterns.

[0172] Specifically, in the first frame, either the likelihood of theHMM of “p,” which is a keyword component HMM of the keyword “presentlocation,” or the likelihood of the extraneous-speech component HMM isincluded in the calculation of the cumulative likelihood. In this case,a higher cumulative likelihood provides the calculation of the nextcumulative likelihood. In the above example, the likelihood of theextraneous-speech component HMM is higher than the likelihood of the HMMof “p,” and thus calculation of the cumulative likelihood for“p-r-e-se-n-t ####” is terminated after “p.”

[0173] Thus, in this type of matching process, only one cumulativelikelihood is calculated for each keyword “present location” and“destination.”

[0174] Next, a keyword recognition process according to this embodimentwill be described with reference to FIG. 6.

[0175]FIG. 6 is a flowchart showing operation of the keyword recognitionprocess according to this embodiment.

[0176] First, when a control panel or controller (not shown) inputsinstructions each part to start a keyword recognition process andspontaneous speech is inputted the microphone 101 (Step S11), the inputprocessor 102 extracts speech signals of the spontaneous speech frominputted speech signals (Step S12), divides the extracted speech signalsinto frames of a predetermined duration, and outputs them to the speechanalyzer 103 by each frame (Step S13).

[0177] Then, the following processes are performed on a frame-by-framebasis.

[0178] First, the speech analyzer 103 extracts the feature value of theinputted speech signal in each frame, and outputs it to the firstlikelihood calculator 106 and second likelihood calculator 107 (StepS14).

[0179] Specifically, based on the speech signal in each frame, thespeech analyzer 103 extracts spectral envelope data that representspower at each frequency at regular time intervals or cepstrum dataobtained from an inverse Fourier transform of the logarithm of the powerspectrum as the feature values of speech ingredient, converts theextracted feature values into vectors, and outputs the vectors to thefirst likelihood calculator 106 and second likelihood calculator 107.

[0180] Next, the first likelihood calculator 106 compares the featurevalue of the inputted frame with the feature values of each HMMs storedin the keyword model database 104, calculates the output probability andstate transition probability of the frame with respect to each HMM model(as described above), and outputs the calculated output probabilitiesand state transition probabilities to the matching processor 109 (StepS15).

[0181] Next, the second likelihood calculator 107 compares the featurevalue of the inputted frame with the feature value of theextraneous-speech component HMM model stored in the garbage modeldatabase 105, calculates the output probability and state transitionprobability of the frame with respect to the extraneous-speech componentHMM (as described above) (Step S16).

[0182] Then, the second likelihood calculator obtains the correctionvalue calculated in advance by the correction processor 108 using themethod described above, adds the correction value to the outputprobability and state transition probability of the frame with respectto the extraneous-speech component HMM, and outputs the resulting outputprobability and state transition probability (with correction value) tothe matching processor 109 (Step S17).

[0183] Next, the matching processor 109 calculates the cumulativelikelihood of each keyword in the matching process described above (StepS18).

[0184] Specifically, the matching processor 109 integrates eachlikelihoods of each keyword HMM and the extraneous-speech component HMM,and eventually calculates only the highest cumulative likelihood for thetype of each keyword.

[0185] Then, at the instruction of the controller (not shown), thematching processor 109 determines whether the given frame is the lastdivided frame (Step S19). If the matching processor 109 determines asthe last divided frame, the matching processor 109 outputs the highestcumulative likelihood for each keyword to the determining device 110(Step S20). If the frame is not determined as the last divided one, thisoperation performs the process of Step S14.

[0186] Finally, based on the cumulative likelihood of each keyword, thedetermining device 110 outputs the keyword with the highest cumulativelikelihood as the keyword contained in the spontaneous speech externally(Step S21). This concludes the operation.

[0187] Thus, according to this embodiment, since keywords andspontaneous speech are identified properly based on the storedextraneous-speech component feature data, the extraneous speech can beidentified properly using a small amount of data, making it possible toincrease identifiable extraneous speech without increasing the amount ofdata needed for recognition of extraneous speech and improve theaccuracy with which keywords are extracted and recognized.

[0188] Specifically, when the garbage model are generated with featurevalues of speech ingredients of a plurality of extraneous words,relatively low likelihoods of each HMMs are accumulated over the entirespontaneous speech during speech recognition. Consequently, acombination of extraneous speech HMM and a keyword HMM to be recognizedmay have a lower cumulative likelihood than the one calculated for othercombination of other keyword HMM and extraneous speech HMM which ismatched accidentally. In that case, surrounding sounds such as noisearound where the spontaneous speech is uttered may cause misrecognitionif they are loud enough to be picked up by the speech recognitionapparatus.

[0189] However, according to this embodiment, since the likelihood of amatch between the extracted spontaneous-speech feature values and theextraneous-speech feature HMM is calculated by using a preset correctionvalue and at least either the keywords to be recognized or theextraneous speech contained in the spontaneous speech is determinedbased on the calculated likelihood, identifiable extraneous speech canbe increase without increasing the amount of data needed for recognitionof extraneous speech and the accuracy with which keywords are extractedand recognized is improved.

[0190] Furthermore, according to this embodiment, since the likelihoodof a match between the extracted spontaneous-speech feature values andthe extraneous-speech feature HMM is calculated by using a presetcorrection value, the calculated likelihood can be adjusted.

[0191] Consequently, even under conditions in which misrecognition couldoccur due to noise level around where the spontaneous speech is utteredor due to calculation error produced when preparing extraneous-speechfeature data by combining characteristics of a plurality of featurevalues to reduce the amount of data, the likelihood of a match betweenthe extracted spontaneous-speech feature values and theextraneous-speech feature data can be adjusted by using a correctionvalue. This makes it possible to identify the extraneous speech andkeywords properly, which in turn makes it possible to preventmisrecognition and recognize keywords reliably.

[0192] Incidentally, although extraneous-speech component models aregenerated based on syllables according to this embodiment, of course,they may be generated based on phonemes or other units.

[0193] Furthermore, although one extraneous-speech component HMM isstored in the garbage model database 105 according to this embodiment,an HMM which represents feature values of extraneous-speech componentsmay be stored for each group of a plurality of each type of phonemes, oreach vowels, consonants.

[0194] In this case, the feature values computed on a frame-by-framebasis in the likelihood calculation process will be theextraneous-speech component HMM and likelihood of each extraneous-speechcomponent.

[0195] Furthermore, although the keyword recognition process isperformed by the speech recognition apparatus described above accordingto this embodiment, the speech recognition apparatus may be equippedwith a computer and recording medium and a similar keyword recognitionprocess may be performed as the computer reads a keyword recognitionprogram stored on the recording medium.

[0196] Here, a DVD or CD may be used as the recording medium.

[0197] In this case, the speech recognition apparatus will be equippedwith a reading device for reading the program from the recording medium.

[0198] Although according to this embodiment, the correction value isadded to the likelihood of a match between the extraneous-speechcomponent HMM and feature values of frames based on the noise level ofsurrounding sounds around where the spontaneous speech is uttered, it isalso possible to use a correction value calculated empirically inadvance.

[0199] In this case, for example, the correction value is obtained bymultiplying the likelihood calculated in a normal manner by ±0.1. Thus,the correction value α is given by Eq. (6).

α=β×(±0.10)  Eq. (6)

[0200] Incidentally β represents the likelihood calculated by theextraneous-speech component HMM.

[0201] [Second Embodiment]

[0202] FIGS. 7 to 8 are diagrams showing a speech recognition apparatusaccording to a fourth embodiment of the present invention.

[0203] This embodiment differs from the first embodiment in that acorrection value is calculated by using the word length of the keywordto be recognized, i.e., the length ratio between spontaneous speech andthe keyword contained in the spontaneous speech instead of a settingoperation of a correction value calculated based on the noise level ofthe surrounding sounds collected by the correction processor. In otherrespects, the configuration of this embodiment is similar to that of thefirst embodiment. Thus, the same components as those in the firstembodiment are denoted by the same reference numerals as thecorresponding components and description thereof will be omitted.

[0204] First, configuration of the speech recognition apparatusaccording to this embodiment will be described with reference to FIG. 7.

[0205] As shown in FIG. 7, the speech recognition apparatus 200comprises a microphone 101, input processor 102, speech analyzer 103,keyword model database 104, garbage model database 105, first likelihoodcalculator 106, second likelihood calculator 107, correction processor120 which makes corrections based on the lengths of the keyword andspontaneous speech when calculating likelihood for each frame based onthe feature value of the frame and extraneous-speech component HMM,matching processor 109, and determining device 110.

[0206] In the correction processor 120, the inputted keyword lengthacquired by the determining device 110 and the inputted length ofspontaneous speech acquired by the input processor 102 are inputted.Furthermore, the correction processor 120 calculates the ratio of thekeyword length to the length of the spontaneous speech, calculates acorrection value based on the calculated ratio of the keyword length,and outputs the calculated correction value to the second likelihoodcalculator 107.

[0207] Specifically, when the length ratio is 0% to 39%, the correctionvalue α is given by Eq. (7).

α=β×(−0.10)  Eq. (7)

[0208] Incidentally, β represents the likelihood calculated by theextraneous-speech component HMM. When the length ratio is 40% to 74%, nocorrection value is used.

[0209] When the length ratio is 75% to 100%, the correction value α isgiven by Eq. (8).

α=β×0.10  Eq. (8)

[0210] These correction values are output to the likelihood calculator106.

[0211] Next, a keyword recognition process according to this embodimentwill be described with reference to FIG. 8.

[0212]FIG. 8 is a flowchart showing operation of the keyword recognitionprocess according to this embodiment.

[0213] First, when a control panel or controller (not shown) inputsinstruction each part to start a keyword recognition process andspontaneous speech are inputted to the microphone 101 (Step S31), theinput processor 102 extracts speech signals of the spontaneous speechfrom inputted speech signals (Step S32), divides the extracted speechsignals into frames of a predetermined duration, and outputs them to thespeech analyzer 103 by each frame (Step S33).

[0214] Then, the following processes are performed on eachframe-by-frame basis.

[0215] First, the speech analyzer 103 extracts the feature value of thespeech signal in each frame, and outputs it to the first likelihoodcalculator 106 (Step S34).

[0216] Specifically, based on the speech signal in each frame, thespeech analyzer 103 extracts spectral envelope data that representspower at each frequency at regular time intervals or cepstrum dataobtained from an inverse Fourier transform of the logarithm of the powerspectrum as the feature values of speech ingredient, converts theextracted feature values into vectors, and outputs the vectors to thefirst likelihood calculator 106 and second likelihood calculator 107.

[0217] Next, the first likelihood calculator 106 compares the featurevalue of the inputted frame with the feature values of each HMMs storedin the keyword model database 104, calculates the output probability andstate transition probability of the frame with respect to each HMM model(as described above), and outputs the calculated output probabilitiesand state transition probabilities to the matching processor 109 (StepS35).

[0218] Next, the second likelihood calculator 107 compares the featurevalue of the inputted frame with the feature value of theextraneous-speech component model HMM stored in the garbage modeldatabase 105, and thereby calculates the output probability and statetransition probability of the frame with respect to theextraneous-speech component HMM (as described above) (Step S36).

[0219] Then, the second likelihood calculator obtains the correctionvalue calculated in advance by the correction processor 120 using themethod described above, adds the correction value to the outputprobability and state transition probability of the frame with respectto the extraneous-speech component HMM, and outputs the resulting outputprobability and state transition probability to the matching processor109 (Step S37).

[0220] The matching processor 109 calculates the cumulative likelihoodof each keyword in the matching process described above (Step S38).

[0221] Specifically, the matching processor 109 integrates eachlikelihoods of each inputted keyword HMM and the extraneous-speechcomponent HMM, and eventually calculates only the highest cumulativelikelihood for each type of the keyword.

[0222] Then, at the instruction of the controller (not shown), thematching processor 109 determines whether the given frame is the lastdivided frame (Step S39). If it is determined as the last divided frame,the matching processor 109 outputs the highest cumulative likelihood foreach calculated keyword to the determining device 110 (Step S40). If theframe is not determined as the last divided one, this operation performsthe process of Step S34.

[0223] Then, based on the cumulative likelihood of each keyword, thedetermining device 110 outputs the keyword with the highest cumulativelikelihood as the keyword contained in the spontaneous speech (StepS41).

[0224] Next, the correction processor 120 obtains the length of thespontaneous speech from the input processor 102 and the keyword lengthfrom the determining device 110 and calculates the ratio of the keywordlength to the length of the spontaneous speech (Step S42).

[0225] Finally, based on the calculated ratio of the keyword length tothe length of the spontaneous speech, the correction processor 120calculates the correction value described above (Step S43), and storesit for use in the next operation. This concludes the current operation.

[0226] Thus, according to this embodiment, since keywords andspontaneous speech are identified properly based on the storedextraneous-speech component feature data, the extraneous speech can beidentified properly by using a small amount of data, making it possibleto increase identifiable extraneous speech without increasing the amountof data needed for recognition of extraneous speech and improve theaccuracy with which keywords are extracted and recognized.

[0227] Furthermore, according to this embodiment, since the likelihoodof a match between the extracted spontaneous-speech feature values andthe extraneous-speech feature HMM using a preset correction value iscalculated, the likelihood using the preset correction value can beadjusted.

[0228] Consequently, even under conditions in which misrecognition couldoccur due to calculation error produced when preparing extraneous-speechfeature data by combining characteristics of a plurality of featurevalues to reduce the amount of data, the likelihood of a match betweenthe extracted spontaneous-speech feature values and theextraneous-speech feature data can be adjusted by using a correctionvalue. This makes it possible to identify the extraneous speech andkeywords properly, which in turn makes it possible to preventmisrecognition and recognize keywords reliably.

[0229] Incidentally, although extraneous-speech component models aregenerated based on syllables according to this embodiment, of course,they may be generated based on phonemes or other units.

[0230] Furthermore, although one extraneous-speech component HMM isstored in the garbage model database 105 according to this embodiment,an HMM which represents feature values of extraneous-speech componentsmay be stored for each group of a plurality of each type of phonemes, oreach vowels, consonants.

[0231] In that case, the feature values computed on a frame-by-framebasis in the likelihood calculation process will be theextraneous-speech component HMM and likelihood of each extraneous-speechcomponent.

[0232] Furthermore, although the keyword recognition process isperformed by the speech recognition apparatus described above accordingto this embodiment, the speech recognition apparatus may be equippedwith a computer and recording medium and a similar keyword recognitionprocess may be performed as the computer reads a keyword recognitionprogram stored on the recording medium.

[0233] On the speech recognition apparatus which executes the keywordrecognition program, a DVD or CD may be used as the recording medium.

[0234] In that case, the speech recognition apparatus will be equippedwith a reading device for reading the program from the recording medium.

[0235] The entire disclosure of Japanese Patent Application No.2002-114632 filed on Apr. 17, 2002 including the specification, claims,drawings and summary is incorporated herein by reference in itsentirety.

What is claimed is:
 1. A speech recognition apparatus for recognizing atleast one of keywords contained in uttered spontaneous speech,comprising: an extraction device for extracting a spontaneous-speechfeature value, which is feature value of speech ingredient of thespontaneous speech, by analyzing the spontaneous speech; a database inwhich at least one of keyword feature data indicating feature value ofspeech ingredient of said keyword and at least one of anextraneous-speech feature data indicating feature value of speechingredient of extraneous-speech is prestored, a calculation device forcalculating likelihood which indicates probability that at least part ofthe feature values of the extracted spontaneous speech is matched withsaid keyword feature data and said extraneous-speech feature data; and adetermining device for determining at least one of said keywords to berecognized and said extraneous-speech based on the calculatedlikelihood, wherein the calculation device calculates the likelihood byusing a predetermined correction value when said calculation devicecalculates the likelihood which indicates probability that at least partof the feature values of the extracted spontaneous speech is matchedwith said extraneous-speech feature data.
 2. The speech recognitionapparatus according to claim 1, further comprising a setting device forsetting the correction value based on noise level around where thespontaneous speech is uttered, and wherein the calculation devicecalculates the likelihood by using the set correction value when saidcalculation device calculates the likelihood which indicates probabilitythat at least part of the feature values of the extracted spontaneousspeech is matched with said extraneous-speech feature data.
 3. Thespeech recognition apparatus according to claim 1, further comprising asetting device for setting the correction value based on the ratiobetween duration of the determined keyword and duration of thespontaneous speech when the determining device determines at least oneof said keywords to be recognized and said extraneous speech based onthe calculated likelihood, and wherein said calculation devicecalculates the likelihood by using the set correction value when saidcalculation device calculates the likelihood which indicates probabilitythat at least part of the feature values of the extracted spontaneousspeech is matched with said extraneous-speech feature data.
 4. Thespeech recognition apparatus according to claim 1, wherein saidextraneous-speech feature data prestored in said database has data offeature values of speech ingredient of a plurality of theextraneous-speech.
 5. The speech recognition apparatus according toclaim 1, in case where an extraneous-speech component feature dataindicating feature value of speech ingredient of extraneous-speechcomponent which is component of the extraneous speech is prestored insaid database, wherein: said calculation device for calculatinglikelihood based on said extraneous-speech component feature data whensaid calculation device calculates the likelihood which indicatesprobability that at least part of the feature values of the extractedspontaneous speech is matched with said extraneous-speech feature data,and said determining device for determining at least one of saidkeywords to be recognized and said extraneous-speech based on thecalculated likelihood.
 6. A speech recognition method of recognizing atleast one of keywords contained in uttered spontaneous speech,comprising: an extraction process of extracting a spontaneous-speechfeature value, which is feature value of speech ingredient of thespontaneous speech, by analyzing the spontaneous speech; an acquiringprocess of acquiring at least one of keyword feature data indicatingfeature value of speech ingredient of said keyword and at least one ofan extraneous-speech feature data indicating feature value of speechingredient of extraneous-speech, said keyword feature data andextraneous-speech feature data prestoring in a database; a calculationprocess of calculating likelihood which indicates probability that atleast part of the feature values of the extracted spontaneous speech ismatched with said keyword feature data and said extraneous-speechfeature data; and a determination process of determining at least one ofsaid keywords to be recognized and said extraneous-speech based on thecalculated likelihood, wherein said calculation process calculates thelikelihood by using a predetermined correction value when saidcalculation process calculates the likelihood which indicatesprobability that at least part of the feature values of the extractedspontaneous speech is matched with said extraneous-speech feature data.7. The speech recognition method according to claim 6, furthercomprising a setting process of setting the correction value based onnoise level around where the spontaneous speech is uttered, and whereinsaid calculation process calculates the likelihood by using the setcorrection value when said calculation process calculates the likelihoodwhich indicates probability that at least part of the feature values ofthe extracted spontaneous speech is matched with said extraneous-speechfeature data.
 8. The speech recognition method according to claim 6,further comprising a setting process of setting the correction valuebased on the ratio between duration of the determined keyword andduration of the spontaneous speech when the determination processdetermines at least one of said keywords to be recognized and saidextraneous speech based on the calculated likelihood, and wherein saidcalculation process calculates the likelihood by using the setcorrection value when said calculation process calculates the likelihoodwhich indicates probability that at least part of the feature values ofthe extracted spontaneous speech is matched with said extraneous-speechfeature data.
 9. The speech recognition method according to claim 6,wherein said extraneous-speech feature data prestored in said databasehas data of feature values of speech ingredient of a plurality of theextraneous-speech.
 10. The speech recognition method according to claim,in case where an extraneous-speech component feature data indicatingfeature value of speech ingredient of extraneous-speech component whichis component of the extraneous speech is prestored in said database,wherein: said calculation process of calculating likelihood based onsaid extraneous-speech component feature data when said calculationprocess calculates the likelihood which indicates probability that atleast part of the feature values of the extracted spontaneous speech ismatched with said extraneous-speech feature data, and said determinationprocess of determining at least one of said keywords to be recognizedand said extraneous-speech based on the calculated likelihood.
 11. Arecording medium wherein a speech recognition program is recorded so asto be read by a computer, the computer included in a speech recognitionapparatus for recognizing at least one of keywords contained in utteredspontaneous speech, the program causing the computer to function as: anextraction device for extracting a spontaneous-speech feature value,which is feature value of speech ingredient of the spontaneous speech,by analyzing the spontaneous speech; an acquiring device for acquiringat least one of keyword feature data indicating feature value of speechingredient of said keyword and at least one of an extraneous-speechfeature data indicating feature value of speech ingredient ofextraneous-speech, said keyword feature data and extraneous-speechfeature data prestoring in a database; a calculation device forcalculating likelihood which indicates probability that at least part ofthe feature values of the extracted spontaneous speech is matched withsaid keyword feature data and said extraneous-speech feature data; and adetermining device for determining at least one of said keywords to berecognized and said extraneous-speech based on the calculatedlikelihood, wherein said calculation device calculates the likelihood byusing a predetermined correction value when said calculation devicecalculates the likelihood which indicates probability that at least partof the feature values of the extracted spontaneous speech is matchedwith said extraneous-speech feature data.
 12. The recording mediumaccording to claim 11, wherein the program further causes the computerto function as a setting device for setting the correction value basedon noise level around where the spontaneous speech is uttered, andwherein said calculation device calculates the likelihood by using theset correction value when said calculation device calculates thelikelihood which indicates probability that at least part of the featurevalues of the extracted spontaneous speech is matched with saidextraneous-speech feature data.
 13. The recording medium according toclaim 11, wherein the program further causes the computer to function asa setting device for setting the correction value based on the ratiobetween duration of the determined keyword and duration of thespontaneous speech when the determining device determines at least oneof said keywords to be recognized and said extraneous speech based onthe calculated likelihood, and wherein said calculation devicecalculates the likelihood by using the set correction value when saidcalculation device calculates the likelihood which indicates probabilitythat at least part of the feature values of the extracted spontaneousspeech is matched with said extraneous-speech feature data.
 14. Therecording medium according to claim 11, wherein the program furthercauses the computer to function as said extraneous-speech feature dataprestored in said database has data of feature values of speechingredient of a plurality of the extraneous-speech.
 15. The recordingmedium according to claim 11, in case where an extraneous-speechcomponent feature data indicating feature value of speech ingredient ofextraneous-speech component which is component of the extraneous speechis prestored in said database, wherein the program further causes thecomputer to function as: said calculation device for calculatinglikelihood based on said extraneous-speech component feature data whensaid calculation device calculates the likelihood which indicatesprobability that at least part of the feature values of the extractedspontaneous speech is matched with said extraneous-speech feature data,and said determining device for determining at least one of saidkeywords to be recognized and said extraneous-speech based on thecalculated likelihood.